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O ' Abstract 

^. 

O ■ We consider the problem of providing service guarantees in a high-speed packet switch. 

f— s ! As basic requirements, the switch should be scalable to high speeds per port, a large number of 

ports and a large number of traffic flows with independent guarantees. Existing scalable solu- 

O ■ tions are based on Virtual Output Queuing, which is computationally complex when required 

to provide service guarantees for a large number of flows. 

k><( , We present a novel architecture for packet switching that provides support for such service 

^ ■ guarantees. A cost-effective fabric with small external speedup is combined with a feedback 



mechanism that enables the fabric to be virtually lossless, thus avoiding packet drops indis- 
criminate of flows. Through analysis and simulation, we show that this architecture provides 
accurate support for service guarantees, has low computational complexity and is scalable to 
very high port speeds. 
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1 Introduction 

High speed communication between businesses has been a large share of telecommunications mar- 
ket in recent years. This communication needs to be of high quality, secure and reliable. Tradi- 
tionally, these services were provided using ATM and Frame Relay technologies, but at a premium 
cost. Recent advances in traffic engineering and the advent of Voice over IP technologies provide 
an opportunity to carry all enterprise traffic (voice, streaming and non-real-time data) at a lower 
cost. Virtual Private Networks (VPNs) El and Virtual Private LAN Services (VPLS) are two 
examples of such network services. A main requirement for such services is to provide quality 
of service (QoS) guarantees. Interactive media such as VoIP needs low delay and low loss, other 
traffic needs minimum throughput guarantees. 

In this paper we consider the problem of providing such guarantees in a high-speed, cost- 
effective switch at the interface (edge) between enterprise and service provider networks. At a 
minimum, the switch is required to provide three types of service: Premium, Assured and Best 
Effort 0,11111. Premium service provides low loss and small delay for a flow sending within a pre- 
determined rate limit (anything above the limit is discarded). Assured service guarantees delivery 
for traffic within a limit, but allows and forwards extra traffic within a higher limit if transmit 
opportunities are available. 

A provider edge switch is required to differentiate between traffic from different customers 
(here called flows) and provide separate guarantees to each flow. A requirement is to support a 
large number (in the order of hundreds or even thousands) of such flow guarantees per port, where 
each port must support speeds in the order of several Gbps. Traffic from one customer (flow) 
can enter through one or multiple ingress ports and exit through one or multiple ports. On the 
other hand, to come up with practical solutions, we assume that the provided service guarantees 
only need to be enforced over timescales in the order of a few milliseconds, which is enough for 
most applications, thereby alleviating the traditional requirement that service guarantees have to be 
enforced over timescales as small as a single packet transmission time. We consider the problem 
of providing 1-to-l and N-to-1 services (or "Pipe" and "Funnel scope" as defined in lH^I), as 1- 
to-N and N-to-N can be provided as combinations of services of the first two kinds. In the case 



of Assured N-to-1 service, it is also desirable to provide a fair distribution of service among the N 
components of the flow. 

Current state-of-the-art switch architectures are based on Virtual Output Queuing (VOQ), which 
requires a fabric speedup s > 2 and a matching algorithm to find which packets are sent into the 
fabric at each fabric cycle. However, realizing a speed-up of s > 2 may be impractical at very 
high line speeds (> 10 Gbps) given the limitations on memory access speeds. Furthermore, even 
though some of the VOQ architectures can support service guarantees, a major problem is that 
the matching algorithms have high complexity, are run at each fabric cycle, and all virtual output 
queues at all input lines in the system need to participate in a centralized algorithm lEUIl . 

To provide a low-complexity switch architecture that fulfills the above requirements, we ob- 
serve that the main cause for high complexity in current architecture resides in the necessity of ad- 
dressing congestion at an output line. Short term congestion can be absorbed by buffers, whereas 
long term congestion results in packet loss. We also observe that many measurement studies (for 
example [|17in have shown that traffic in the Internet is dominated by the TCP protocol, which 
accounts for about 90% of all traffic. A salient feature of TCP is that packet transmission is con- 
trolled by a congestion avoidance algorithm ifTSll . ll24ll . As an effect, the average sending rate of a 
TCP flow is a decreasing function of drop probability and of round trip time (see | |22| for a quan- 
titative evaluation of this function). In practice, TCP flows have a stable (long-term) operation at 
when the drop probability is between and 0. 1 , corresponding to loss rates less than 10%, and very 
rarely operate above 0.2 ll22ll . Heavy long-term congestion that results in a drop probability above 
0.2 can be produced by non-TCP (and more generally, non-congestion-controlled) traffic such as 
multimedia traffic over UDP 

Our proposed architecture, named "Feedback Output Queuing" (FOQ), exploits these observa- 
tions by efficiently supporting fast fabrics with relatively slow output memory interfaces and hence 
a small effective speedup. For example, a speedup of 1 .25 at the fabric-to-line interface is sufficient 
to maintain an output drop probability up to 0.2 for traffic flows fully utilizing this interface. For 
higher levels of long-term congestion (e.g., drop probability above 0.2), the FOQ architecture uses 
a feedback mechanism to reducing the traffic volume before it enters the switch fabric. This FOQ 



mechanism provides support for the Assured service, 1-to-l and N-to-1 scope. 

As far as Premium traffic is concerned, given that rate guarantees are ensured to be within 
switch capacity by some admission control procedure, policing Premium traffic at its guaranteed 
rate at the ingress guarantees that Premium traffic cannot create congestion in the absence of other 
types of traffic. Thus, Premium service can be provided through a simple priority scheduling in 
OUT ports and fabric, bypassing the FOQ mechanism. 

In the following we show through analysis and simulation studies that the proposed FOQ ar- 
chitecture can alleviate congestion at the output lines of an output queued switch with slow output 
memory interface, and can thus provide deterministic QoS guarantees. FOQ requires only a mod- 
est speedup (e.g., 1 .3) at the output interface of the switch. The congestion control algorithm in the 
FOQ architecture is fully parallelized at the input and output lines, requiring 0(1) complexity at 
each input and output line. This low complexity enables implementation of the FOQ architecture 
at very high line rates (> 10 Gbps). 

The rest of the paper is organized as follows. In the next section we discuss the related work in 
more details. Then, we give a detailed description of the FOQ architecture in Section|3lln Section|4| 
we develop an analytical model for FOQ, based on a PI controller, and analyze its performance 
under step-shaped traffic bursts, before introducing a quantized version of a PI controller. We 
present our simulation results in Section |5l and conclude the paper with a comparison between 
FOQ and VOQ in Section 

2 Related Work 

Several switch architectures with QoS capabilities have been proposed in the literature, with par- 
ticular advantages and shortcomings. 

An early architecture is Output Queuing (OQ). An OQ switch having A^ inputs and A^ outputs 
with each line of speed c bits/second requires a switching fabric of speed Nc, i.e., a speedup s = N. 
In this case, no congestion occurs at the inputs or at the fabric, only at the output lines. To manage 
congestion and provide QoS support, a set of queues and a scheduling mechanism is implemented 
at each output. The main advantage of this architecture is that it can provide QoS support with 



simple mechanisms of queuing and scheduling, but the main problem is that the fabric speedup of 
A^ can be impractical. In fact current technology enables fast interconnection networks operating 
at current high speed line rates and with typical number of lines (for example c = 10 Gbps and 
A^ = 16), but writing the packets coming out of the interconnection network into output buffers at 
high speeds remains a problem. In other words, although the fabric may have an internal speedup 
of A^, the effective speedup seen at an output buffer is limited by the memory write speed which is 
usually much less. 

An alternative to OQ is Virtual Output Queuing (VOQ) lUl, lITSl . which requires a smaller 
fabric speedup, such as s in the range between 2 and 4. Unlike OQ, VOQ requires a matching 
algorithm to find which packets will be sent into the fabric at each fabric cycle. There are quite 
a few such algorithms proposed in the literature, which are based on Parallel Iterative Matching, 
Time Slot Assignement, Maximal Matching, or Stable Matching (see [20| and references therein). 
Some of these algorithms can also support service guarantees. The advantage of VOQ is its ability 
to switch high speed lines with low fabric speedup. However its main problem is that the matching 
algorithms are complex (0(M^A^^) where M is the number of independent service guarantees per 
port, A^ is the number of ports), have to be run at each fabric cycle, and all VOQs at all input 
lines in the system need to participate in a centralized algorithm. We note that Output Queued 
switches can also be perfectly emulated by Combined Input-Output Queued (CIOQ) switches with 
a speed-up s > 2 [5|. Unfortunately, the arbitration algorithm has a computational complexity of 
0{N^), which can be reduced to 0{N), but in that case, the space complexity becomes linear in 
the number of cells in the switch. Therefore, emulating an OQ switch by a CIOQ switch or a VOQ 
switch appears to have limited scalability. 

In recent years, these potential scalability concerns have been addressed by implementing a 
very small number of independent service guarantees. Under the Differentiated Services frame- 
work [3 1, flows are aggregated in M = 6 classes, and service guarantees are offered for classes. 
The downside is that the realized QoS per flow has a lower level of assurance (higher probability 
of violating the desired service level) than the QoS per aggregate lfT3l . ll25l . Moreover, recently 
proposed VPN and VLAN services ||23|, ||4| require per- VPN or VLAN QoS guarantees. All the 



above are arguments in favor of implemeting a number of independent service guarantees per port 
much larger than six. 

More recent proposals llT6l decrease the time interval between two runs of the matching al- 
gorithm, but with a tradeoff in increased burstiness and additional scheduling algorithms for miti- 
gating unbounded delays. Moreover, the service presented in IIT5I is of type Premium 1-to-l, but 
cannot provide Assured N-to-1 service. 

Last, similar to the FOQ architecture proposed in this paper, the IBM Prizma switch archi- 
tecture lfT9l uses a shared memory, and no centralized arbitration algorithm. However, Prizma 
relies on on-off flow control while the feedback scheme proposed in the present paper dynamically 
controls the amount of traffic admitted into the fabric, and FOQ feedback is based on the state of 
the output queues, while Prizma relies on the state of internal switch queues. Both the origin of 
the information and the dynamic control of the drop level lead us to believe that FOQ can use the 
capacity available in the switch more efficiently. 

3 Feedback Output Queuing Architecture 

We consider a switch as in Figure [l] with a fabric having internal speedup of A^ and an internal 

buffer capability.' We also assume that the fabric has one or a very small number of queues per 

port. In the following we present an architecture for providing per-flow service guarantees where 

the number of flows per port M is large, that is, M ^ 1. 

Packets enter through a set of A^ input ports of speed c. As a packet is received at port i, a 

destination port j is determined by a routing module, its QoS flow k is determined by a classifier 

and an IN dropper determines if the packet is discarded. If not discarded, the packet is transmitted 

to the fabric through a line of speed sc. We assume a fabric with internal speed of Nsc, i.e., at each 

fabric cycle one packet from each IN line can be moved to an OUT line while sustaining speeds of 

sc from all IN lines. Multiple (up to N) packets can be received at an OUT line in one cycle, and 

in that case the packets are placed in a fabric queue FQj corresponding to the destination line j. 

'This fabric has a cost-effective implementation using shared memory technology. The case of zero/small memory 
fabric with no/small internal speedup is a separate problem, and we report our study elsewhere. 
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Figure 1 : Detailed FOQ switch architecture 



Packets are forwarded by the OUT line j at speed sc, separated into OUT queues {OQj^k}k 
based on their QoS flow, and scheduled for transmission to OUT port j of speed c. The OUT 
scheduling implements various service guarantees such as priority, minimum rate guarantee, max- 
imum rate limit, maximum delay guarantee. This OUT scheduling results in a certain service rate 
(in general variable in time) for each OUT queue. 

If traffic to OQj^k has a rate higher than the current service rate of flow k, packets accumulate 
in this queue and some of them may be dropped by a queue management mechanism such as drop- 
tail or RED (see ||9|| for details). If the traffic to all queues at OUT line j amounts to an aggregate 
rate above sc, then packets accummulate at the fabric queue FQj. If this situation persists, FQj 
fills and packets get dropped in the fabric. In this case, QoS guarantees for some flow k may be 
violated since fabric drops do not discriminate between different flows. 

We define the relative congestion at a queue 



rj 



(1) 



where rj and ro are traffic rates input to and output from the queue respectively. It is easy to see 
that, as long as the traffic coming out of OUT line j is such that the relative congestion Cj^k at each 



queue {OQj^kjk is below a threshold dmax < 1 — l/s, and the OUT port j is utilized at its full 
capacity c, then the traffic throughput at the interface of fabric to OUT line j is below sc, and thus 
there is no congestion at that interface and no fabric drop. 

In the FOQ architecture, a feedback mechanism is introduced to control the relative congestion 
at each OUT queue below a threshold. When the relative congestion at an OUT queue increases, 
the feedback mechanism instructs the input modules to drop a part of the traffic destined to this 
queue. By keeping the traffic below a congestion threshold, the fabric drop is avoided. Thus, packet 
are dropped only from those flows that create congestion, and the QoS guarantees are provided to 
all flows as configured. 

It is worth noting that the flows having packets dropped at ingress by FOQ would have packets 
dropped in the same amount at egress in the case of an ideal Output Queuing with speedup of 
A^. Thus, FOQ reduces the demand of fabric throughput by eliminating the need for forwarding 
packets that are later discarded. 

Realizations of FOQ We next consider options for a practical realization of the FOQ architec- 
ture. More precisely, we consider implementations of FOQ as a discrete feedback control system. 
A certain measure of congestion is sampled at intervals of duration T at each OUT queue. A 
control algorithm computes a drop indication based on the last sample and an internal state, and 
transmits it to all IN modules. There, packets of the indicated class are randomly dropped with a 
probability that is a function of the drop indication. 

We have several ways to measure the congestion at a queue. A simple method is to compute 
the average drop probability at the queue during the sampling interval: 

DropProb{T) = DroppedPkts{T) / InPkts(T) . 

Another measure is the relative congestion during the interval T, similar to ^: 

RelCong{T) = 1 - OutPkts{T) / InPkts{T) . 

Observe that, unlike the drop probability, the relative congestion takes into account the variation of 
the queue size during T. Since the FOQ objective is to keep the traffic rate at the fabric interface 
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below a critical level, it is apparent that the relative congestion is more effective in controlling that 
traffic rate. This is confirmed by the model in SectionHJand the simulation in Section|5] 

We consider a discrete Proportional-Integrator (PI) ifTOll for the feedback control algorithm. In 
Section m we derive its configuration from stability conditons. The PI algorithm outputs a value of 
drop probability between and 1 transmitted to the IN droppers every interval. 

An implementation issue is the data rate of feedback transmission. Considering K classes at 
each of the A^ OUT ports and that the drop information is coded in F bits, the total feedback data 
rate is KNF/T. For example, for K = 1000, A^ = 32, F = 8, T = 1 ms, the feedback data 
rate is 256 Mb/s. It is possible to reduce this rate by reducing the precision of the feedback data, 
and thus its encoding. In an extreme case, the feedback has three values: increase, decrease or 
keep same drop level. All IN modules use this indication in conjunction with a pre-defined table 
of drop levels. We call this the "Gear-Box algorithm" (GB), model it in Section |3| and show its 
performance in Section |5l 

4 A Control Theoretical Model for the GB Algorithm 

In this section we develop an analytical model for the FOQ architecture by a control theoretical 
approach. In our analysis, we use a classical discrete PI controller to adjust the drop rate of each 
flow. We simplify our analysis by assuming only a single flow at first, and later discuss how and 
under what conditions our results may apply to the general multi-flow case. We also assume in 
our analysis that there is no limitation to the capacity of the feedback channel in the system. We 
then show that an efficient algorithm for limited-capacity feedback channels can be obtained by 
quantizing the control decisions of the PI controller, which we call the Gear Box algorithm. 

The basic control structure at a particular OUT port j and for a particular flow k is shown 
in Figure |21 If there are a total of K flows in each OUT port, then each OUT port has K such 
controllers. All variables we use in this section are for the aggregate traffic in flow k originating 
from all IN ports and destined to OUT port j, unless we note otherwise (i.e., we don't use the 
subscript (j, k) for notational convenience). A is the total arrival rate for traffic destined for the 
OUT queue OQj^k- A total portion, p, of the arriving traffic is dropped at the IN droppers, and the 
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Figure 2: FOQ architecture. 

surviving portion goes into the fabric queue FQj at a rate u = \ — p. This traffic shares the fabric 
queue with other traffic destined to OUT line j, and then it is delivered to OUT dropper (j, k) at a 
rate r. In the analysis we assume the fabric queue is sufficiently large, so that there are no drops 
due to queue overflow. 

The total drop rate, p, is adjusted by a controller (how p is distributed among the A^ IN droppers 
is not relevant for this analysis; we explain how we implement the actual drop mechanism in the 
next section). The purpose of the controller is to keep the fabric output rate for packets destined to 
OQj^k at a desired level, ropt- The desired rate can be chosen according to the current rate out of 
OQj^k 

ropt = <ysro(j,k), 

where a is a constant smaller than but close to 1 . In this way the desired rate will be close to the 
capacity, sc, of fabric output line when the OUT queue OQj^k is the only busy queue and utilizing 
the entire speed of port j. Furthermore it will be reduced in proportion to the service rate of OQj^k 
when multiple OUT queues are contending for the OUT port. The two nonlinearities in the figure 
simply state that the drop rate can not be negative or greater than the arrival rate A. In our analysis 
we assume that the controller is operating in the linear region, and ignore the nonlinearities. 

The delay T between the output of the controller and the arrival rate models a zero-order hold 
at the controller output. The controller operates on time-average of the error signal taken over an 
interval T, rather than the signal itself, and modifies its output only at intervals of T. In the rest of 



10 



this section we denote the time-average of a signal x(t) over the period T by the discrete notation 
x[n]. For example the time-average of the fabric output rate is given by 

I rin+l)T 

r[n] = - r{t)dt. 

J- JnT 

When the system is in steady state, the amount of traffic, g, in the fabric queue destined to OQj^k 
does not change significantly during the interval T. Therefore, we can approximate the average 
fabric output rate by 

r[n] ^ - / u{t)dt 

^ JnT 

= \[n]-p[n-l\. (2) 

For a discrete PI controller the drop rate for the next interval is calculated using the error between 
the average fabric output rate, r[n], and the desired fabric output rate, Voptyn], 

n 

p\p\ = Ke[n] + Kj \^ e[m] 

m=0 

= K{r[n]-ropt[n]) 

(n n 

^r[m]- ^ropt[m] 

We can now investigate the step response of the system, setting \[n] = Aq and ropt[n] = Vopt 
for n > 0, for the case of a single flow. The magnitude of the arrival rate can in general be larger 
than the maximum fabric output rate, i.e., Aq > sc. In this case the fabric output will be constant 
at r[n] = sc for an initial period < n < A^o- During this period the fabric queue will always be 
non-empty and the controller can not sense the actual magnitude of the arrival rate. Therefore the 
controller output will increase linearly, 

p[n] = K{sc - Topt) + {n+ l)Ki{sc - Topt)- 

The fabric queue size, measured at the end of each period, will increase until the drop rate reaches 
Ao — sc and then decrease back to zero 



Qn = T^(Ao -sc-p[m- 1]) 



m=0 

T[{n + l)(Ao - sc) - nK{sc - Topt) 
-^^^^^^K,{sc-r,p,)]. (3) 
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The duration of this initial period, Nq, and the maximum queue size can easily be calculated from 
this quadratic equation setting gATo-i = 0. To find the behavior of the system for n > A^o we use a 
new time axis, n' = n ~ Nq, with an initial condition for the accumulator memory 

p[n'] = K{r[n]-ropt[n']) 

/ n' n' \ 

+Kj ^ r[m] - ^ ropt[m] + Sno 



<m=0 m=0 



(4) 



where 



Sno = KiNo{sc-~ropt). 

Equations Q and © describe a closed-loop control system. We show in the appendix that the 
two poles of this system are at 

z. = - ^ + f'^ - '-^iK + K,-iy + 4K. 
It follows that we have the stability condition given by the proposition below. 
Proposition 1. The closed-loop system described by (0) and (0) is stable iff 

0<Kj <2{1-K). (5) 

Proof. If K + Ki > 1 then 1 2:2 1 > 1 2:1 1 , and both poles are inside the unit circle iff 

K + Kj-l + ^{K + Ki- 1)2 + AK < 2, 

which yields 

A-+f <1. 
On the other hand \f K + Ki <1 then | ^2 1 < | -^i | , and both poles are inside the unit circle iff 

-{K + Ki-l) + ^J{K + Ki-lf + AK < 2, 

which yields 

Combining the two cases gives the condition for stability. D 
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In the appendix we solve the system with the stability condition (|5J and show that the controller 
output is given by 

[K+{n + l)Ki\{sc-ropt), n < N^ 
p[n\ = < (6) 



where 



2 
^1 


D ^1 


4- 


7 

D ^2 



A, 



A2 

Zl - Z2 



and 



-D — Ao — Topt 

is the difference between the arrival and the desired rates. We observe that after the initial linear in- 
crease, the drop rate approaches exponentially to the difference between the arrival and the desired 
rates. Furthermore, since the absolute value of the negative pole is relatively larger for Kj > 1 — K, 
the system will show more oscillatory behavior in this case compared to the Kj < 1 — K case. 

Multiple flows When there are multiple flows, the analysis for the initial period (n < Nq) needs 
to be updated. Let v be the total rate of the traffic that does not belong to flow k but destined to 
port j. If the step size for flow k is such that \ + v > sc then for an initial period the average fabric 
output rate for flow k is approximately 

r T u\n] 

r[n\ = sc- 



v[n] + u[n] 

Since r is not constant anymore, the previous results for the initial period do not apply in general. 
However, once the transient is over and u and v are adjusted so that u[n] + v[n] < sc, the approxi- 
mation © holds, and the results for the single-flow case can be used replacing S^o by a new initial 
condition. We defer a detailed analysis of the initial transient period for the multi-flow case to a 
future study. However, in two cases, when m or t> is negligible compared to the other, the results 
for the single-flow case can be used with some changes. If u ^ v, then r[n] ^ sc and we can 
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approximate the multiple-flow case by the single-flow case. On the other hand, if u <^ v then we 
can assume that v is constant since the effect of the new traffic, u will be negligible. Therefore 

r 1 ^N r 1 

r[n\ ^ sc = au[n\ 

with a = sc/v during the initial period n < Nq. In this case A^o is defined by 

Ao - p[A^o - I] + V = sc. 

For n < No the drop rate can be calculated by replacing Q with 

r[n] ^ o"(A[r7,] — p[n — 1]). 

The response for n > A^o is still given by Q but with a new initial condition replacing Snq ■ 

Quantized PI - the Gear Box algorithm A practical implementation of the discrete-time PI 
control described above requires a few modifications to the control loop. The first modification 
is related to how the bytes will actually be dropped at the desired drop rate calculated by the 
controller. The drop rate has to be divided fairly among the A^ IN droppers. Furthermore it is well- 
known that dropping consecutive packets may result in poor performance in the affected flows. 
Therefore it is desirable to spread the drop rate to an interval and to introduce some randomness 
into the drop process. For these reasons we introduce a packet drop probability, p[n], which is 
updated at intervals of T according to the desired drop rate and the estimated average arrival rate, 

p[n] = J^ = (^-Pl^-mn]_ 
^ ^ \[n+l\ r[n] 

Note that here we used the fabric output rate divided by the admit probability (i.e., 1 —p[n — 1]) 
as an estimate of the next average arrival rate. This is justified for the cases where the average 
arrival rate is a slowly varying function relative to interval T and the delay 

The second modification to the feedback structure is related to the constraint on the size of the 
feedback channel, which becomes a limiting factor on the precision of the feedback signal at high 
speeds. Our goal is to use only a finite number of drop probability values, and to derive a controller 
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that will have a similar performance with the PI controller. For this purpose we expand Q as 



X[n 



I Ke[n] + fsT/ V elm] ] 

+ 1] V £l 7 



n-l 



= -. {Ke[n -1] + Ki\2 e[m] + Ke[n] + Kje[n] - Ke[n - 1]) . 

^[^ + 1] ^1 

Using again the assumption A[ri + 1] ^ A[ri], we can rewrite the above equation as 

p[n] K. p[n — 1] + {Ke[n] + Kie[n] — Ke[n — 1]) 

X[n + 1] 

= p\n - 1] + ^^ ~ ^h~ ^^^ {Ke\n] + Kje[n] - Ke\n - 11) 
r[n\ 

^ _ {K + Kj)e[n] - Ke[n - 1] \ _ jK + Kj)e[n] - Ke[n - 1] 

r[n\ J r[n] 

Now, if we define 

^_ {K + Kj)e[n]-Ke[n-1] 

^""J ~ r[n] 

then the update for the drop probability simply becomes 

p[n] = (1 — 5[n])p[n — 1] + S[n]. 

In order to use finite values of p[ri] we quantize 6[n] to three levels 



5g[n\ 



(3 5[n]> Amax 

-A,,i„ < 6[n] < A^ax (8) 



^ 6[n]<-^, 



K /3 
Then the update for discrete probability values becomes 

Pq[n] = (1 - 5q[n])pq[n - 1] + 5jra], 

which can also be written as an update of admit probabilities as 

l-p,M = (l-5,M)(l-pJn-l]). 
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If we set K = 0, then (jSJ) can also be expressed in terms of the relative congestion C[n] 

1 — ro [n] /r [n] as 

(3 C[n]> c/max 
6g[n] = 



where 



and 



dr, 



Citnir 



/3 
/3-1 


C[n 


^ "min ' 





otherwise 


1- 


1 

— + 
as 


^max 


asKj^ 


1 


1 
as 


^min 




asKr' 



We call the quantized mechanism with K = the Gear Box (GB) controller, since there are 
only three possible actions: increase the drop probability, decrease the drop probability, and no 
change. With the GB controller it is sufficient to have a 2-bit feedback signal every T seconds. 
Furthermore the different levels of the admit probabilities are the different powers of (1 — /5). 
Therefore the calculation at the IN droppers can be implemented by storing 

as a table in the memory and just updating a pointer to this table based on the feedback signal. 

To increase the stability of the control loop, in our implementation of the GB algorithm, we 
choose the value for (3 such that the relative congestion after a step increase or decrease in IN drop 
probability be equal. To find the value for P that has this property, when note that when the relative 
congestion C reaches (imax» the drop step is increased, and the relative congestion immediately 
changes to a different value Cnew,i- More precisely, if we have: 

ri 
then rj changes to r/_„e«i = ri{l — P),so 

ro 



which can be rewritten as 



^np.mA -L 



^n.eiDA -L 



r/(l-/9) 

i Umax 

1-/5 
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Likewise, when C reaches c^mim the drop step is decreased and the relative congestion immediately 
changes to a different value Cnew,2- That is, 

ri 
has the effect of changing rj to rj^new = /ilg) , yielding 

n 

that is 

Cnew,2 = 1 — (1 — «min)(l ~ P) ; 

and we want to have Cnew,i = Cnew,2- Hence, 



which reduces to 



1-d 



(l-rf,nin)(l-/3) 






p=i~\ r "^r" (9) 



giving finally 

/^= 1 - W 

as the value for /9 such that the relative congestion after a step increase or decrease in IN drop 
probability be equal. 

We illustrate the behavior of the system when subject to the configuration of ® in Figure 



where rfmid = 1 — a/(1 — (irnin)(l — c^max)- Whcu the input rate increases such that the output 
relative congestion goes from d^i^ to d^^,^, the input drop probability remains at the same level, 
and jumps to Pi when the output relative congestion reaches d^^^. This jump in the input drop 
probability has the immediate effect of causing the output relative congestion to decrease to a value 
c^mid- Then, if the output relative congestion increases again to d^nax, the input drop probability 
remains at Pi before jumping to P2 when the output relative congestion reaches (imax- Now, if the 
input drop probability is at P2, and the relative congestion decreases from d^^ to d^i^i, the input 
drop probability remains at P2, and jumps down to Pi as soon as the relative congestion reaches 
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Figure 3: FOQ dynamics and stability 

(imin. The decrease in the input drop probability from P2 to Pi immediately increases the output 
relative congestion to rfmid. 

As shown in Figure |3l this configuration has the key advantage of providing hysteresis to the 
GB control, by always trying to have the relative congestion come back to dmid, thereby providing 
stability against small perturbations. We will use this configuration in our simulations presented in 
the following. 

5 Simulation Experiments 

The objective of this section is to present a set of experimental results that illustrate the salient 
properties of FOQ. First, we describe a relatively simple experiment with three classes of traffic 
and constant-bit-rate (CBR) traffic, before presenting experimental results gathered for a more 
realistic situation where traffic consists of a large number of non-synchronized TCP sources. 
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Figure 4: Throughput plots 

5.1 FOQ and Service Guarantees 

We simulate a 16x10 Gbps-port switch with a 5 MB shared memory fabric having external speedup 
s = 1.28, 2 MB drop-tail OUT queues per flow, and no ingress queues. The FOQ-GB mechanism 
has a sampling rate T = 1 ms and feedback thresholds dmax = 0.17, dmin = 0.02. We run each 
simulation for 200 ms. 

The offered load is composed of three flows sending at constant rates starting at t = 0: flow 0: 
0.952 Gbps, flow 1 and 2: 9.52 Gbps each, all ingressing on separate ports and exiting the same 
port. Given that the total offered load is 20 Gbps, the OUT port has a potential 200% overload. 
The required guarantee for flow is Premium service (0.952 Gbps rate guarantee), and minimum 
rate guarantees of 7.75 Gbps and 1.3 Gbps are required for flows 1 and 2 respectively. Flow 
is assigned to Fabric queue at high priority, and flows 2 and 3 to Fabric queue 1 at lower 
priority. At the OUT scheduler, each flow is assigned a separate queue. Queue is scheduled at 
high priority, whereas queues 2 and 3 are scheduled at lower priority in a Weigted Fair Queuing 
discipline between them with 6 : 1 weights, corresponding to the required rate guarantees. 

In Figure |4l we plot the evolution in time of the service rate for the three flows, without and 
with FOQ respectively. In Figure 15] we show the dynamics of drop rate for the same scenarios. In 
all plots, each datapoint corresponds to an average over a sliding window of size 1 ms. Flow is 
serviced at its arrival rate in both cases, due to its high priority assignment in the fabric and OUT 
scheduler. But the rate received by flow 1 in the non-FOQ case, 5.93 Gbps (Figure llfa)), is below 
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Figure 5: Drop rate plots 
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Figure 6: Delay plots 

its requirement. This is due to the drop in the fabric queue 1 (Figure |5tc)) without discrimination 
between flows 1 and 2. When using FOQ (Figure Htb)), flow 1 receives 7.62 Gbps and flow 2 
1.37 Gbps, thus both achieving their minimum rate guarantees. This is explained by the FOQ 
action reflected in Figure |5tb) where we see an increase of input drop for flows 1 and 2 as a 
reaction to output congestion. As a consequence, the fabric drop is zero almost all the time in the 
FOQ case, in contrast with the high drop rate in the base case. The spike in fabric drop is due 
to the transient state where ingress drop is increasing but not yet sufficient for eliminating fabric 
congestion. With FOQ, fabric drop occurs only at bursts with high rate and long duration. It can 
be mitigated by larger fabric memory or higher frequency of feedback. Also note that flow is not 
affected even during the FOQ transient due to its assignment to the high priority fabric queue. 

In Figure |6l we show the dynamics of packet transit delay through the whole switch. While 
flow receives minimum delay in both cases due to its high priority assignment, flows 1 and 
2 experience delays that are proportional to their respective service rates (their OUT queues are 
close to full in the steady state due to the drop-tail queue management). 

5.2 FOQ Dynamics with TCP Traffic 

Next, we examine the interaction of FOQ-GB with TCP traffic. To that effect, we run a simulation 
where 4,500 TCP sources send traffic through a switch. In this experiment, we only consider 
one class of traffic. Four subnets containing 1,000 TCP sources each and one subnet containing 
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Figure 7: Ingress drops and fabric queue. FOQ manages to maintain a low fabric queue by dropping 
packets at the input links. When FOQ is not present, there are no input drops. 

500 TCP sources are connected to the switch by five independent 1 Gbps links. All sources send 
traffic to the same destination subnet, which is also connected to the switch by a 1 Gbps link, with 
a one-way propagation delay of 20 ms. We have the number of active TCP flows increase over 
time as follows. Each source in the first subnet starts sending traffic between t = s and t = 1 s, 
according to a uniform random variable. Then, each source in the second subnet starts sending 
traffic between t = 2 s and t = 3 s. Subsequently, every two seconds, sources in an additional 
subnet start transmitting. Hence, we have no overload between t = s and t = 2 s, a potential 2: 1 
overload in the fabric between t = 2 s and t = 4 s, a 3:1 overload between t = 4 s and t = 6 s, 
a 4: 1 overload between t = 6 s and t = 8 s, and a 5: 1 overload then on. There is a potential s : 1 
bottleneck at the output port of the switch governing the 1 Gbps link to the destination subnet after 
t = 2 s. All TCP sources send 1,040-byte packets. 

The FOQ parameters, are chosen as in the previous experiment, i.e., s = 1.28, (imax = 0.17 
and dinin = 0.02. The fabric queue has now a size of 500 KB and the output queue has a size of 
400 KB. The output queue runs RED, with maxp = 0.5, laaxxH = 300 KB, ramxH = 100 KB, 
a sampling time of 1 ms, and a weight Wg = 0.1. We compare the performance of the switch with 
and without FOQ. 

We first observe in Figure Wih), where each datapoint represents a moving average over a 
sliding window of size 50 ms, that, regardless of the potential overload, FOQ consistently manages 
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Figure 8: Fabric and output losses. FOQ manages to completely avoid fabric losses, and also significantly 
reduces the amount of traffic dropped at the output link. 

to maintain the fabric backlog extremely close to zero, by dropping packets at the input links. As 
illustrated in Figure Ha), input drops increase with the overload. Conversely, without FOQ, and 
therefore in the absence of input drops, the fabric buffer is filling up with the number of active TCP 
sources, and is eventually completely full once all sources have started transmitting. Ultimately, as 
illustrated in Figure Eta), traffic is dropped in the fabric. There are no fabric drops when FOQ is 
used. 

Last, we observe in Figure [Sfb) that the output loss rate is limited by 1 — 1/s ~ 21.8% when 
FOQ is disabled. On the other hand, FOQ maintains the egress relative congestion close to d^id = 
0.098, as shown in Figure|9ta), and consequently, the output loss rate remains close to 9.8%. When 
the loss rates become roughly constant, the output queue length, represented in Figure |3b), also 
becomes constant, by virtue of a stable RED control ^21- 

As a conclusion to this second experiment, we have shown that FOQ's objectives of preventing 
fabric drops and regulating the traffic that arrives at the output link were met in the case of an 
experiment with a large number of TCP sources. The results were even more positive than those 
obtained with constant-rate sources, as FOQ does not exhibit transient behaviors in this scenario. 
This can be justified by the fact that FOQ feedback is run at a much higher frequency (every 
T = 1 ms) than the TCP congestion control algorithms, which are run with an approximately 
40-ms delay here. 
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Figure 9: Relative congestion and output queue. FOQ maintains the relative congestion between d^ 
and dmax- 

6 Discussion and Conclusions 



In this paper we presented the Feedback Output Queuing architecture for packet switching that 
provides support for service guarantees when the switching speed is limited by the memory read 
and write speeds. Using a fast switching fabric in this case leads to a build-up in fabric buffers 
and eventually either to buffer overflow and packet discarding or to unbounded delays at the fabric 
inputs due to backpressure. The FOQ architecture solves this problem by triggering packet discard 
only from flows that exceed their allocated bandwidth, and therefore limiting the build-up and 
delay at the fabric buffers. In the worst case the arrival rate will be \max, the total input capacity 
of the fabric. For the PI controller the maximum fabric queue size and the maximum delay in the 
fabric can be calculated from Q by inserting Aq = Xmax- Any delay value above this number 
can be deterministically guaranteed to a flow by using a proper scheduler (e.g. WFQ-based) at the 
output queues after the fabric. 

An alternative approach to solve the same problem is to use VOQ at fabric inputs. Recent 
studies show that VOQ can also provide deterministic delay bounds II2TI . This is however at the 
expense of computational complexity. VOQ algorithms require 0{N'^) computations per packet 
slot to determine which packets will be sent to their destinations. This high computational com- 
plexity makes the VOQ approach less feasible for high bit-rate switches. In contrast, the FOQ 
requires a total of 0{N) computations per packet slot and 0{KN) computations per feedback 

24 



interval, where K is the number of supported classes. Since the feedback interval is much larger 
than a packet slot, computations for the feedback are actually negligible. Furthermore, the com- 
putations are distributed to the inputs and outputs, so that each input and output performs 0(1) 
computations. In other words, FOQ's computational complexity is much lower than VOQ, the 
current state of the art. 

We applied discrete feedback control theory to derive a stable configuration for FOQ. Through 
analysis and simulations we showed that a quantized version of a PI controller named "Gear-Box 
control" is stable, responds quickly to traffic bursts and provides highly accurate QoS guarantees. 

We believe that this work has sparked many venues for future research. There is a range 
of control algorithms to be investigated besides those presented here. The interaction between the 
TCP congestion control algorithm and FOQ (and RED queue management) is an interesing control 
problem. The FOQ architecture can be extended with a set of input queues in order to provide zero 
loss for a wider range of bursty traffic, given a limited fabric memory size. 
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Appendix 

In this appendix we give a detailed derivation of some of the equations. 
Taking the z-transforms of Q and ^, we get 

p{z) = K {R{z) - Roptiz)) 

+Ki^ iR{z) - Roptiz)) 
z — 1 

+Sno{z) (10) 

and 

R{z) = X{z)-z'^p{z). (11) 

Transfer functions of this system between the output rate, R, and the two inputs and initial state, 
A, Ropt, and Snq, are given by 

Rjz) z{z-l) 

Jiz) ~ z^ + {K + Ki-l)z-K' 

R{z) {K + Ki)z - K 



Roptiz) z^ + {K + Kj-l)z-K' 

and 

R{z) _ 1- z 

Sno{z) ~ z^ + {K + Ki - l)z - K 
Let Zi and Z2 be two roots of the system characteristic equation, i.e. 

^2 + {K + Ki- l)zi,2 -K = Q. 

Then without loss of generality 

z, = - ^^^^'-\ \^{K + K,-l)- + AK 

z. = _^±|l^ _ '-^iK + Kj-l)^ + 4K. 
We showed in Proposition 1 that the system is stable if 

(}<Ki<2{l~K). 
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We next find the solution for the drop rate p assuming this stability condition is satisfied. For step 
inputs and initial condition, \{z) = z\/{z — 1), Ropt{z) = zropt/{z — 1), Si:<[^^{z) = zSn,j/{z — 1), 
and defining D = \ — Vopt as the difference between the arrival and the desired rates, we have from 
(Cni) and (HI]): 

p{z) = i+K+A 

_ 2 1{K+Ki)D+Sno]z~KD-Sno 
~ ^ (z-l)(z^ + iK+Ki-l)z-K} ■ 

This can be written as a partial fraction expansion as 



^Z — I Z — Z\ Z — Z2 

where 

A, 



y2 ^NO 



D ^1 



Zl — Z2 

and 



^2 _ ^ 

A 



Z2 ^Z2 



Zl - Z2 

which can be solved easily. Finally recall that this system was obtained initially by defining a new 
time axis for n > Nq. Therefore after taking the inverse z-transform we combine the result with 
n < No case to get 



\ [K + {n + l)Kj]{sc-ropt), n < Nq 
p[n\ = < 

D(l-Aizr^«+A24-^»), n>No 
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